摘要 :
Predictive maintenance of systems and their components in technical systems is a promising approach to optimize system usage and reduce system downtime. Various sensor data are logged during system operation for different purposes...
展开
Predictive maintenance of systems and their components in technical systems is a promising approach to optimize system usage and reduce system downtime. Various sensor data are logged during system operation for different purposes, but sometimes not directly related to the degradation of a specific component. Variable selection algorithms are necessary to reduce model complexity and improve interpretability of diagnostic and prognostic algorithms. This paper presents a forest-based variable selection algorithm that analyzes the distribution of a variable in the decision tree structure, called Variable Depth Distribution, to measure its importance. The proposed variable selection algorithm is developed for datasets with correlated variables that pose problems for existing forest-based variable selection methods. The proposed variable selection method is evaluated and analyzed using three case studies: survival analysis of lead-acid batteries in heavy-duty vehicles, engine misfire detection, and a simulated prognostics dataset. The results show the usefulness of the proposed algorithm, with respect to existing forest-based methods, and its ability to identify important variables in different applications. As an example, the battery prognostics case study shows that similar predictive performance is achieved when only 17% percent of the variables are used compared to all measured signals.
收起
摘要 :
Random Forests (RFs) and Gradient Boosting Machines (GBMs) are popular approaches for habitat suitability modelling in environmental flow assessment. However, both present some limitations theoretically solved by alternative tree-...
展开
Random Forests (RFs) and Gradient Boosting Machines (GBMs) are popular approaches for habitat suitability modelling in environmental flow assessment. However, both present some limitations theoretically solved by alternative tree-based ensemble techniques (e.g. conditional RFs or oblique RFs). Among them, eXtreme Gradient Boosting machines (XGBoost) has proven to be another promising technique that mixes subroutines developed for RFs and GBMs. To inspect the capabilities of these alternative techniques, RFs and GBMs were compared with: conditional RFs, oblique RFs and XGBoost by modelling, at the micro-scale, the habitat suitability for the invasive bleak (Alburnus alburnus L.) and pumpkinseed (Lepomis gibbosus L). XGBoost outperformed the other approaches, particularly conditional and oblique RFs, although there were no statistical differences with standard RFs and GBMs. The partial dependence plots highlighted the lacustrine origins of pumpkinseed and the preference for lentic habitats of bleak. However, the latter depicted a larger tolerance for rapid microhabitats found in run-type river segments, which is likely to hinder the management of flow regimes to control its invasion. The difference in the computational burden and, especially, the characteristics of datasets on microhabitat use (low data prevalence and high overlapping between categories) led us to conclude that, in the short term, XGBoost is not destined to replace properly optimised RFs and GBMs in the process of habitat suitability modelling at the micro-scale.
收起
摘要 :
Decision trees are widely used predictive models in machine learning. Recently, -tree is proposed, where the original discrete feature space is expanded by generating all orderings of values of discrete attributes and these order...
展开
Decision trees are widely used predictive models in machine learning. Recently, -tree is proposed, where the original discrete feature space is expanded by generating all orderings of values of discrete attributes and these orderings are used as the new attributes in decision tree induction. Although -tree performs significantly better than the proper one, their exponential time complexity can prohibit their use. In this brief, we propose -forest, an extension of random forest, where a subset of features is selected randomly from the induced discrete space. Simulation results on 17 data sets show that the novel ensemble classifier has significantly lower error rate compared with the random forest based on the original feature space.
收起
摘要 :
This study examines backcountry visitors' preferences for truly ancient forest ecosystems. We find that visitors consider ancient forests a distinctly different ecosystem than mature, but younger forests dominated by the same tree...
展开
This study examines backcountry visitors' preferences for truly ancient forest ecosystems. We find that visitors consider ancient forests a distinctly different ecosystem than mature, but younger forests dominated by the same tree types, and that the recreational value of forests continues to grow for several hundred years following a crown fire. By employing a random coefficients model of utility the analysis is able to provide measures of the variability in preferences for forest ecosystems across the population of users. The model also shows that site choice probabilities and welfare effects associated with ancient woodlands are sensitive to the mix of dominating tree types, and exhibit substantial fluctuation over trails.
收起
摘要 :
Decision Support System plays an important role in making decisions. Decision support system may use data mining techniques for solving problem. Astronomy is an area where Data Mining has been playing a major role. As the astronom...
展开
Decision Support System plays an important role in making decisions. Decision support system may use data mining techniques for solving problem. Astronomy is an area where Data Mining has been playing a major role. As the astronomical data is very huge, the classification of celestial bodies is the main issue of concern. To improve the classification accuracy a new improved weighted random Forest algorithm is suggested. A decision support system is designed using Weighted Random forest algorithm. The algorithm is implemented in Java. It is observed that weighted random forest performs better than random forest and other tree based data mining classification techniques.
收起
摘要 :
In QSAR, a statistical model is generated from a training set of molecules (represented by chemical descriptors) and their biological activities. We will call this traditional type of QSAR model an "activity model". The activity m...
展开
In QSAR, a statistical model is generated from a training set of molecules (represented by chemical descriptors) and their biological activities. We will call this traditional type of QSAR model an "activity model". The activity model can be used to predict the activities of molecules not in the training set. A relatively new subfield for QSAR is domain applicability. The aim is to estimate the reliability of prediction of a specific molecule on a specific activity model. A number of different metrics have been proposed in the literature for this purpose. It is desirable to build a quantitative model of reliability against one or more of these metrics. We can call this an "error model". A previous publication from our laboratory (Sheridan J. Chem. Inf. Model., 2012, 52, 814-823.) suggested the simultaneous use of three metrics would be more discriminating than any one metric. An error model could be built in the form of a three-dimensional set of bins. When the number of metrics exceeds three, however, the bin paradigm is not practical. An obvious solution for constructing an error model using multiple metrics is to use a QSAR method, in our case random forest. In this paper we demonstrate the usefulness of this paradigm, specifically for determining whether a useful error model can be built and which metrics are most useful for a given problem. For the ten data sets and for the seven metrics we examine here, it appears that it is possible to construct a useful error model using only two metrics (TREE-SD and PREDICTED). These do not require calculating similarities/distances between the molecules being predicted and the molecules used to build the activity model, which can be rate-limiting.
收起
摘要 :
Over the past few decades, the remarkable prediction capabilities of ensemble methods have been used within a wide range of applications. Maximization of base-model ensemble accuracy and diversity are the keys to the heightened pe...
展开
Over the past few decades, the remarkable prediction capabilities of ensemble methods have been used within a wide range of applications. Maximization of base-model ensemble accuracy and diversity are the keys to the heightened performance of these methods. One way to achieve diversity for training the base models is to generate artificial/synthetic instances for their incorporation with the original instances. Recently, the mixup method was proposed for improving the classification power of deep neural networks (Zhang, Cisse, Dauphin, and Lopez-Paz, 2017). Mixup method generates artificial instances by combining pairs of instances and their labels, these new instances are used for training the neural networks promoting its regularization. In this paper, new regression tree ensembles trained with mixup, which we will refer to as Mixup Regression Forest, are presented and tested. The experimental study with 61 datasets showed that the mixup approach improved the results of both Random Forest and Rotation Forest. (C) 2020 Elsevier Ltd. All rights reserved.
收起
摘要 :
Although the methods of bagging and random forests are some of the most widely used prediction methods, relatively little is known about their algorithmic convergence. In particular, there are not many theoretical guarantees for d...
展开
Although the methods of bagging and random forests are some of the most widely used prediction methods, relatively little is known about their algorithmic convergence. In particular, there are not many theoretical guarantees for deciding when an ensemble is "large enough"-so that its accuracy is close to that of an ideal infinite ensemble. Due to the fact that bagging and random forests are randomized algorithms, the choice of ensemble size is closely related to the notion of "algorithmic variance" (i.e., the variance of prediction error due only to the training algorithm). In the present work, we propose a bootstrap method to estimate this variance for bagging, random forests and related methods in the context of classification. To be specific, suppose the training dataset is fixed, and let the random variable ERRt denote the prediction error of a randomized ensemble of size t. Working under a "first-order model" for randomized ensembles, we prove that the centered law of ERRt can be consistently approximated via the proposed method as t -> infinity. Meanwhile, the computational cost of the method is quite modest, by virtue of an extrapolation technique. As a consequence, the method offers a practical guideline for deciding when the algorithmic fluctuations of ERRt are negligible.
收起
摘要 :
O objetivo deste trabalho foi aplicar o algoritmo “random forest” (RF) à modelagem do estoque de carbono acima do solo (CAS) de uma floresta tropical, por meio da testagem de três procedimentos de sele??o de variáveis: remo??...
展开
O objetivo deste trabalho foi aplicar o algoritmo “random forest” (RF) à modelagem do estoque de carbono acima do solo (CAS) de uma floresta tropical, por meio da testagem de três procedimentos de sele??o de variáveis: remo??o recursiva e algoritmos genéticos (Ags) uniobjetivo e multiobjetivo. Os dados utilizados abrangeram 1.007 parcelas amostradas na bacia hidrográfica do Rio Grande, no estado de Minas Gerais, Brasil, e 114 variáveis ambientais (climáticas, edáficas, geográficas, de terreno e espectrais). A melhor estratégia de sele??o de variáveis – a RF com AG multiobjetivo – chega ao menor erro quadrático de 17,75 Mg ha-1 com apenas quatro variáveis espectrais – índice de umidade por diferen?a normalizada, textura de correla??o do índice de queimada por raz?o normalizada 2, cobertura arbórea e fluxo de calor latente –, o que representa redu??o de 96,5% no tamanho do banco de dados. As estratégias de sele??o de variáveis ajudam a obter melhor desempenho da RF, ao melhorar a acurácia e reduzir o volume dos dados. Embora a remo??o recursiva e o AG multiobjetivo mostrem desempenho semelhante como estratégias de sele??o de variáveis, esta último apresenta menor subconjunto de variáveis, com maior precis?o. As descobertas deste trabalho destacam a importancia do uso de infravermelho próximo, comprimentos de onda curtos e índices de vegeta??o derivados para a estimativa de CAS baseada em sensoriamento remoto. Os produtos MODIS mostram rela??o significativa com o estoque de CAS e precisam ser melhor explorados pela comunidade científica para a modelagem deste estoque.
收起
摘要 :
Quantifying forest structure is important for sustainable forest management, as it relates to a wide variety of ecosystem processes and services. Lidar data have proven particularly useful for measuring or estimating a suite of fo...
展开
Quantifying forest structure is important for sustainable forest management, as it relates to a wide variety of ecosystem processes and services. Lidar data have proven particularly useful for measuring or estimating a suite of forest structural attributes Such as canopy height. basal area, and LAI. However, the potential of this technology to characterize forest succession remains largely untested. The objective of this study was to evaluate the use of lidar data for characterizing forest successional stages across a structurally diverse. mixed-species forest in Northern Idaho. We used a variety of lidar-derived metrics in conjunction with an algorithmic modeling procedure (Random Forests) to classify six stages of three-dimensional forest development and achieved an overall accuracy>95%. The algorithmic model presented herein developed ecologically meaningful classifications based upon lidar metrics quantifying mean vegetation height and canopy cover, among others. This study highlights the utility of lidar data for accurately classifying forest succession in complex, mixed coniferous forests: but further research should be conducted to classify forest successional stages across different forests types. The techniques presented herein can be easily applied to other areas. Furthermore, the final classification map represents a significant advancement for forest succession modeling and wildlife habitat assessment.
收起